{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "第四周作业：对活动进行聚类\n",
    "\n",
    "第四周和第五周的作业均数据来源于Kaggle竞赛：Event Recommendation Engine Challenge，根据\n",
    "events they’ve responded to in the past\n",
    "user demographic information\n",
    "what events they’ve seen and clicked on in our app\n",
    "用户对某个事件是否感兴趣\n",
    "\n",
    "竞赛官网：\n",
    "https://www.kaggle.com/c/event-recommendation-engine-challenge/data\n",
    "\n",
    "活动描述信息在events.csv文件：共110维特征\n",
    "前9列：event_id, user_id, start_time, city, state, zip, country, lat, and lng.\n",
    "event_id：活动的id, \n",
    "user_id：的id .  \n",
    "city, state, zip, and country： 活动地点 (如果知道的话).\n",
    "lat and lng： floats（活动地点的经度和纬度）\n",
    "start_time： 字符串，ISO-8601 UTC t创建活动的用户ime，表示活动开始时间\n",
    "\n",
    "后101列为词频：count_1, count_2, ..., count_100，count_other\n",
    "count_N：活动描述出现第N个词的次数\n",
    "count_other：除了最常用的100个词之外的其余词出现的次数\n",
    "\n",
    "作业要求：\n",
    "根据活动的关键词（count_1, count_2, ..., count_100，count_other属性）做聚类，可采用KMeans聚类\n",
    "尝试K=10，20，30，..., 100, 并计算各自CH_scores。\n",
    "\n",
    "提示：由于样本数目较多，建议使用MiniBatchKMeans。\n",
    "\n",
    "文件说明：\n",
    "1.可以先运行0. EDA.ipynb，看一下竞赛所有数据的情况；\n",
    "总体活动的数目太多（300w+记录），可以只需对训练集train.csv和测试集test.cv出现的活动（13418条记录）举行聚类即可。运行1. Users_Events.ipynb可得到只在训练集train.csv和测试集test.cv出现的活动，可自己修改代码存为csv格式，在进行聚类。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 181,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy     as np       #--\n",
    "import pandas    as pd   #数据处理\n",
    "\n",
    "from matplotlib  import pyplot #--\n",
    "import matplotlib.pyplot as  plt\n",
    "import seaborn   as sns        #可视化\n",
    "\n",
    "from sklearn.model_selection import train_test_split   #分割数据\n",
    "from sklearn.decomposition import PCA                  #数据预处理之降维\n",
    "from sklearn.cluster import MiniBatchKMeans            #基于相似度，距离的聚类算法\n",
    "from sklearn import metrics                             #评价\n",
    "import scipy.sparse as ss\n",
    "from collections import defaultdict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据读取"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 15398 entries, 0 to 15397\n",
      "Data columns (total 6 columns):\n",
      "user              15398 non-null int64\n",
      "event             15398 non-null int64\n",
      "invited           15398 non-null int64\n",
      "timestamp         15398 non-null object\n",
      "interested        15398 non-null int64\n",
      "not_interested    15398 non-null int64\n",
      "dtypes: int64(5), object(1)\n",
      "memory usage: 721.9+ KB\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(15398, 6)"
      ]
     },
     "execution_count": 123,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train = pd.read_csv(\"train.csv\")\n",
    "test = pd.read_csv(\"test.csv\")\n",
    "train.info()\n",
    "train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 10237 entries, 0 to 10236\n",
      "Data columns (total 4 columns):\n",
      "user         10237 non-null int64\n",
      "event        10237 non-null int64\n",
      "invited      10237 non-null int64\n",
      "timestamp    10237 non-null object\n",
      "dtypes: int64(3), object(1)\n",
      "memory usage: 320.0+ KB\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(10237, 4)"
      ]
     },
     "execution_count": 124,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test.info()\n",
    "test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "观察得知：无缺失值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "抽取出只在训练集和测试集中出现的event"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 135,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "numUniqueEvents = {13418}\n",
      "(1, 13418)\n"
     ]
    }
   ],
   "source": [
    "uniqueUsers  = set()                   #创建无序不重复的对象\n",
    "uniqueEvents = set()                   #--\n",
    "\n",
    "# eventsForUser = defaultdict(set)\n",
    "# usersForEvent = defaultdict(set)\n",
    "\n",
    "fileNameSum = [\"train.csv\",\"test.csv\"]\n",
    "for fileName in fileNameSum:\n",
    "    f = open(fileName,\"r\")# 打开，读取\n",
    "    \n",
    "    f.readline().strip().split(\",\")  #将文件指针移向第二行\n",
    "    for line in f:\n",
    "        cols = line.strip().split(\",\")\n",
    "        # uniqueUsers.add(cols[0])     #整合无序不重复序列\n",
    "        uniqueEvents.add(cols[1])\n",
    "    f.close()\n",
    "    \n",
    "numUniqueEvents= len(uniqueEvents)\n",
    "\n",
    "print(\"numUniqueEvents = {%d}\"% numUniqueEvents)\n",
    "\n",
    "df =[uniqueEvents]\n",
    "ClusterTheActivitiesDF = pd.DataFrame(df)\n",
    "print(ClusterTheActivitiesDF.shape)\n",
    "\n",
    "ClusterTheActivitiesDF.index +=1\n",
    "\n",
    "ClusterTheActivitiesDF.index.name = \"ID\"\n",
    "ClusterTheActivitiesDF.to_csv('uniqueEvents.csv', header=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 138,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1, 13419)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([[         1],\n",
       "       [2919165814],\n",
       "       [3684919275],\n",
       "       ...,\n",
       "       [1187251277],\n",
       "       [3795482697],\n",
       "       [2794906451]], dtype=int64)"
      ]
     },
     "execution_count": 138,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainTestEvents = pd.read_csv(\"uniqueEvents.csv\")\n",
    "print(trainTestEvents.shape)\n",
    "Events = np.array(trainTestEvents)\n",
    "Events.reshape((13419,1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现已抽取出只在训练集和测试集中出现的event"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在此cell之前已对训练集和测试集进行数据读取:train,(15398, 6);test,(10237, 4);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 169,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(15398, 3)\n"
     ]
    }
   ],
   "source": [
    "#y_train = [train.interested.values,train.not_interested.values]\n",
    "y_train = train.interested.values\n",
    "f_drop = [\"interested\",\"not_interested\",\"timestamp\"]\n",
    "X_train = train.drop(f_drop,axis = 1)\n",
    "print(X_train.shape)\n",
    "X_test = test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 170,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(15398, 2)\n"
     ]
    }
   ],
   "source": [
    "#降维三部曲\n",
    "#print(train_test.shape)\n",
    "Pca = PCA(n_components = 0.75)#实例\n",
    "Pca.fit(X_train)#训练\n",
    "X_train_pca = Pca.transform(X_train)#降维\n",
    "print(X_train_pca.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 171,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\Dell\\Documents\\anaconda\\lib\\site-packages\\sklearn\\model_selection\\_split.py:2026: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.\n",
      "  FutureWarning)\n"
     ]
    }
   ],
   "source": [
    "X_train_part, X_val, y_train_part, y_val = train_test_split(X_train_pca,y_train, train_size = 0.8,random_state = 33)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 172,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(12318, 2)\n",
      "(3080, 2)\n"
     ]
    }
   ],
   "source": [
    "print(X_train_part.shape)\n",
    "print(X_val.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面为一个参数点（k）的模型，且在校验集上评价聚类算法性能"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 177,
   "metadata": {},
   "outputs": [],
   "source": [
    "#自定义函数,对每一个参数评价他的性能\n",
    "def kClusterAnalysis(K,X_train,y_train,X_val,y_val):\n",
    "    print(\"k == {%d}\"% K)\n",
    "    \n",
    "    mbKmeans = MiniBatchKMeans(n_clusters = K)#产生一个聚类实例\n",
    "    mbKmeans.fit(X_train)#训练\n",
    "    y_val_pre = mbKmeans.predict(X_val)\n",
    "    \n",
    "    #看一下轮廓系数，内部评价指标，在训练数据上算，因为不需要校验的\n",
    "    CH_score = metrics.silhouette_score(X_train,mbKmeans.predict(X_train))\n",
    "    \n",
    "    #因为是现在是有监督的情况，所以在校验集上进行校验\n",
    "    v_score = metrics.v_measure_score(y_val, y_val_pre)\n",
    "    print(\"CH_score: {}\".format(CH_score))\n",
    "    print(\"v_score: {}\".format(v_score))\n",
    "    \n",
    "    return CH_score,v_score#内部评价指标，外部评价指标，可以打出来看一下，"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 178,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "k == {10}\n",
      "CH_score: 0.3274120715404184\n",
      "v_score: 0.001230418051936614\n",
      "k == {20}\n",
      "CH_score: 0.3341025195248938\n",
      "v_score: 0.0024774369503874523\n",
      "k == {30}\n",
      "CH_score: 0.33827507309167504\n",
      "v_score: 0.002793192223375878\n",
      "k == {40}\n",
      "CH_score: 0.3278046246470313\n",
      "v_score: 0.00441750535240075\n",
      "k == {50}\n",
      "CH_score: 0.32452013749323544\n",
      "v_score: 0.004411201725890477\n",
      "k == {60}\n",
      "CH_score: 0.3295620318448941\n",
      "v_score: 0.005378284696389429\n",
      "k == {70}\n",
      "CH_score: 0.3219277773462703\n",
      "v_score: 0.006146618100832908\n",
      "k == {80}\n",
      "CH_score: 0.32271673148947905\n",
      "v_score: 0.006414208933969092\n",
      "k == {90}\n",
      "CH_score: 0.3064867230893742\n",
      "v_score: 0.0063979949997194875\n",
      "k == {100}\n",
      "CH_score: 0.327683207160457\n",
      "v_score: 0.008170996261580948\n"
     ]
    }
   ],
   "source": [
    "Ks = [10, 20, 30,40,50,60,70,80,90,100]#因为手写体数字一共有10类数值，所以是10的数字\n",
    "CH_scores = []\n",
    "v_scores = []\n",
    "for K in Ks:\n",
    "    ch,v = kClusterAnalysis(K, X_train_part, y_train_part, X_val, y_val)\n",
    "    CH_scores.append(ch)\n",
    "    v_scores.append(v)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 182,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<matplotlib.lines.Line2D at 0x25c8559a470>]"
      ]
     },
     "execution_count": 182,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAD8CAYAAAB3u9PLAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzt3Xm8VHX9x/HXh10RRQVNBQMVF5REvaiFmmgaprmk3NRSy4xMCKufC6ammZZZ2u+nomZpaoaIC4q561wXcImLYAiIgJhcN9BwwY3t8/vjc24McOEO3Jk5s7yfj8d9DOfcM+d8Zhznc8/nu5m7IyIi0irtAEREpDQoIYiICKCEICIiCSUEEREBlBBERCShhCAiIoASgoiIJJQQREQEUEIQEZFEm7QDWBtdunTxHj16pB2GiEhZmThx4rvu3rW548oqIfTo0YP6+vq0wxARKStm9u9cjlPJSEREACUEERFJKCGIiAighCAiIgklBBERAZQQREQkoYQgIiKAEkL1+egj+NOfYOHCtCMRkRJTVgPTpIVeegmOPhpeeQXmzIFLL007IhEpIbpDqBZ/+xvsuSd88AHstx9ceSW8/XbaUYlICVFCqHSffQanngonnhgJYdIkuOEGWLQIfvObtKMTkRKihFDJ5syB/v2jzWD4cHjsMdhiC9huOzj55Nj/+utpRykiJUIJoVLddx/svjvMng333gu//S20yWoyOv/8eLzoonTiE5GSo4RQaZYsgXPOgcMPh222gRdeiH+vrHv3KCXddBPMnFn0MEWk9CghVJK334aDDoreQ4MHw/jxkRRW5xe/gPbt4YILihejiJQsJYRK8dRTUSJ6/nm4+eZoH+jQYc3P2XxzGDYMRo2CKVOKE6eIlCwlhHLnDr//PRxwAGywQSSEE0/M/flnngmdOi1vUxCRqqWEUM7efx+OOgrOOise6+uhT5+1O8cmm8AZZ0TD8z//WZg4RaQsKCGUq0mTYI894P774X//F0aPhg03XLdz/fSn0KULnHdefmMUkbKihFBu3GNg2Ze/DJ9/Hm0Hp58OZut+zk6dYpzCo4/Ck0/mL1YRKStKCOXkk09iQNkpp8T0E5MmRWLIh9NOgy23hHPPjaQjIlVHCaFczJwZX/433xzdRB98ELp2zd/511svSkbjx8NDD+XvvCJSNpQQysFdd0V7QUMDPPAAXHghtG6d/+v84AfQo0ckBt0liFQdJYRStngx/PzncMwxsNNOUSIaOLBw12vXLpLNCy/A3XcX7joiUpKUEErVG2/AgAHwxz/CT34CTz8NW29d+Ot+97uw444xLmHp0sJfT0RKRk4JwcwGmtkMM5tlZsOb+P2pZjbFzCab2Tgz653s3zPZN9nMXjSzo7Ke81rWc+rz95IqwOOPw267weTJcNttsXZBu3bFuXbr1jHh3fTpMHJkca4pIiXBvJlasZm1Bl4BDgIagAnAce4+LeuYDd39w+TfhwOnuftAM1sfWOTuS8xsC+BFYMtk+zWgxt3fzTXYmpoar6+v4NyxbFmsUfDLX0aJ6M474zGNOPbYAz78EF5+Gdq2LX4MIpI3ZjbR3WuaOy6XO4Q9gVnu/qq7LwJGAUdkH9CYDBIdAU/2f+LuS5L9HRr3SxPeew8OOyxKNccfH6OG00gGAK1awcUXw6uvwo03phODiBRdLglhK2Bu1nZDsm8FZjbEzGYDlwHDsvbvZWZTgSnAqVkJwoFHzGyimQ1e1xdQEf75z5iY7vHH4dprY7nLjh3Tjekb34hurr/+day6JiIVL5eE0NQQ2FX+0nf3Ee6+LXA2cF7W/ufdfWegH3COmTVOwdnf3XcHDgGGmNl+TV7cbLCZ1ZtZ/fz583MIt4y4wzXXwD77xEjjceNijYKWjDrOFzO45JJo3L7uurSjEZEiyCUhNADds7a7AW+u4fhRwJEr73T36cDHwC7J9pvJ4zxgDFGaWoW7X+/uNe5e0zWfA7HStnBh9OgZMgQOPji6evbrl3ZUKxowAA48MNo1Fi5MOxoRKbBcEsIEoJeZ9TSzdsCxwNjsA8ysV9bmocDMZH9PM2uT/PuLwA7Aa2bW0cw6Jfs7AgcDL7X0xZSN6dNjwftRo+Kv8LFjY9bRUnTJJTB/fvR0EpGK1qa5A5IeQUOBh4HWwI3uPtXMLgLq3X0sMNTMvgYsBhYAJyVP3wcYbmaLgWVE76N3zWwbYIxFaaQNMNLdq2O+hNtugx/+MNoIHn001jEoZXvtBd/8Zqy5cNpp0Llz2hGJSIE02+20lJR1t1P3mGb6yiujzWDUKNhqlbb50vTii9C3b0x8d/HFaUcjImspn91OJR9+85tIBqefDplM+SQDgF13hW9/O9ZdmDcv7WhEpECUEIph7NiYMO6EE2IqinIc6PWrX8Gnn8Kll6YdiYgUiBJCoU2bFr2J+vWLhe9LoUvputhhBzjppOgm29CQdjQiUgBKCIW0YAEccUQ0II8ZE2sOlLNf/jKmtVA7gkhFUkIolCVL4Nhj4fXXYyrpcmozWJ0ePWDw4FjC89VX045GRPJMCaFQhg+HRx6JqSjytcxlKTj33GgDufDCtCMRkTxTQiiEv/0NLr881jE4+eS0o8mvLbaAoUPh1lujfUREKoYSQr5NmBADzwYMiKRQic4+GzbYINoURKRiKCHk01tvwZFHxl/Ro0eXZ/fSXGy6aSztedddMQeTiFQEJYR8+fxzOPpoeP99uPde6NIl7YgK62c/g403jvEVIlIRlBDywT1mLX32WbjlFvjSl9KOqPA22ihKRw8+COPHpx2NiOSBEkI+jBgRXTHPPz/uEqrF0KGw+ebR86iM5sQSkaYpIbRUJhOT1h1xRPV1xezYMZLBk0/CY4+lHY2ItJASQkvMmQODBsGOO0ZX01ZV+HYOHgxbb627BJEKUIXfYHmycGHcFbhHI3KnTmlHlI727aP76YQJMYmfiJQtJYR1sWwZfO97MHUq3H47bLtt2hGl66SToFevaENZtiztaERkHSkhrItLLok++H/4Axx0UNrRpK9Nm5gee8qUSJAiUpa0YtrauvfeGHx24olw003lO511vi1bFquqffZZTGnRptnVWUWkSLRiWiFMnVoZaxsUQqtW8Otfw8yZcPPNaUcjIutACSFX//lPNCJvsEGsbdChQ9oRlZ7DD4c994SLLoqR2yJSVpQQctG4tsHcuZWztkEhmMXiOa+/Dtdfn3Y0IrKWlBBycdZZ8OijcN11lbW2QSF87Wvw1a9Gw/vHH6cdjYishZwSgpkNNLMZZjbLzIY38ftTzWyKmU02s3Fm1jvZv2eyb7KZvWhmR+V6zpJx883wxz/CsGHw/e+nHU3pM4tk8M47cPXVaUcjImuh2V5GZtYaeAU4CGgAJgDHufu0rGM2dPcPk38fDpzm7gPNbH1gkbsvMbMtgBeBLQFv7pxNKXovo+efj792+/eHhx6q3OmsC+Eb34DnnovR3BttlHY0IlUtn72M9gRmufur7r4IGAUckX1AYzJIdCS+8HH3T9x9SbK/Q+P+XM6ZujffhKOOgi23rOy1DQrl4othwQK44oq0IxGRHOWSELYC5mZtNyT7VmBmQ8xsNnAZMCxr/15mNhWYApyaJIiczpmazz6Db30LPvwwxh1sumnaEZWf3XePmV+vuALefTftaEQkB7kkhKY6269SZ3L3Ee6+LXA2cF7W/ufdfWegH3COmXXI9ZwAZjbYzOrNrH7+/Pk5hNtC7vDjH0e56JZboE+fwl+zUl10UTQs/+53aUciIjnIJSE0AN2ztrsBb67h+FHAkSvvdPfpwMfALmtzTne/3t1r3L2ma9euOYTbQlddFSOQL7gg7hJk3fXuHQP5rr46SnCyenPnRnddkRTlkhAmAL3MrKeZtQOOBVaY1tLMemVtHgrMTPb3NLM2yb+/COwAvJbLOVPx+OOxVvCRR2oB+Xy58MIYx3HJJWlHUrqefhp22SUG9RXjLlhkNZpNCEnNfyjwMDAdGO3uU83soqRHEcBQM5tqZpOBnwMnJfv3AV5M9o8heh+9u7pz5vWVra1XX4Xa2ljb4JZbqnNtg0LYZhv4wQ/gz3+G115LO5rSc//9cPDBsNlmsR73D36gdSUkNZrcDuCjj+ArX4E33oh5/at9Out8a2iA7baD446Dv/417WhKx9//HtOo9+0LDzwAI0fG6nvXXBPtWCJ5osntcrVsWcznP21adC9VMsi/bt3gtNPizuvll9OOpjRcfXW0r+y7byzD2rUr/OQn8PWvR9ly+vS0I5QqpITw61/HZHWXXx7TLkhhDB8O660XjfXVzD3WjvjJT2KyxAceWL7aXqtWcQe1wQZw/PGaIFCKrroTwpgx0eh50klw+ulpR1PZNtssyiGjR8PkyWlHk45ly+I9uPDCKBXdeeeqs+ZusQXceGO8R+eem0aUUsWqNyFMmQInnAB77RWT1mltg8I74wzo3Lk6e3AtXhx/eFx5JfzsZ3DDDatfROib34w2hMsvh8ceK26cUtWqMyG8917crm+4YUxnrbUNiqNzZzjzTLjvvpjnqFp8+mmM2r711pjS4/LLm+/F9oc/wE47xcp8771XnDil6lVfQliyBL797ehRNGZMzFUkxTNsWJSPzjuv+WMrwQcfwMCB8I9/RO+hc8/N7W50/fWj19G778Ipp6grqhRF9SWEM8+MAWjXXx/lIimuDTaAc86J/wZ1dWlHU1jz5sGAAfDMM/HlvrZdSfv2hd/+Fu65B/7yl8LEKJKlusYh3HRTrGnw05/GGgeSjs8+g169ojvqU09V5kyy//53DDibOxfuugsOOWTdzrNsWXRFfeYZeOEF2GGH/MYpVUHjEFb23HPwox/BgQfC73+fdjTVrUOHmPjuuedgt93gySfTjii/pk+HffaJRYIefXTdkwFEW8PNN8d79p3vwKJF+YtTysODD8YI9gULCn6p6kgIb74ZE9V16wa337763h1SPN/7XpRCFi6E/fePHl9vv512VC1XXx+DzRYvjkTXv3/Lz7nlltEraeLE6uyhVe3uvTe6KDeOVymgyk8IixfHQjda26C0mEVPr2nToqF19Ogoh1x1VTT8l6O6umgz6NQJxo2DXXfN37mPPBIGD4bLLqv8thdZUSYD++1XlD9kKz8htGkTf33+7W8xo6SUlvXXj66YU6ZEI/+wYdCvX9TMy8k990Rp6ItfjGSw3Xb5v8YVV0TbywknwH/+k//zS+lpaICZM+GAA4pyucpPCGYwdGjcJUjp2n57ePhhuOOO6GrZvz+cfHJ5TAd9000xzqBv32gk36pAi/917Bi9lebNi7uFMuoQIuuo8W5QCUGqjhkcc0w0yp51VtzVbb89XHstLF2adnRNu+KK6Ll24IExqniTTQp7vT32iDuqu+7SzLHVIJOJMneRVm5UQpDSs8EGsezmv/4VvZBOOw323jumJi8V7jG47n/+J5LYffdF3MVwxhnRVjFsWJQTpDK5R0LYf/+irc+ihCCla6edYgDbyJExsnyvvaLrcNpTOSxdGknqkktiFPGoUdC+ffGu36pVTCXerl10RV28uHjXluKZMyeWVS1SuQiUEKTUmcXCOi+/HAMKb7gheiPdcEMM2iq2RYviS/i66+Dss2PEe+vWxY+jW7dYhW7ChJg9VSpPJhOPAwYU7ZJKCFIeNtww6vWTJkHv3vGXef/+sV0sn3wSXWVvvz1KWpdemu4suUcfHQ3vv/1tNGZLZclk4AtfiGV9i0QJQcpLnz4x4OuWW2Id7JqaWGzm/fcLe90FC+Cgg+CRR+Iv87POKuz1cvV//xer/H33u4V/D6R4GtsPDjigqH90KCFI+TGLvvgzZkQt/5proox0yy2F6Yr59tvRsFdfH3cHp5yS/2usqw02iDaWt96CU09VV9RK8fLLMfVJEctFoIQg5axz5xjZXF8P22wTC9Dst18McsuXOXNiXqLZs+H++6NHUanp1y+W5bz99uiqK+Wvsf2giA3KoIQglWC33WD8+Ghofvnl2P7Zz2K6kpZ46aVop/jPf6K3UymvuX322ZEMhwyJ5CXlLZOJUe89exb1sjklBDMbaGYzzGyWmQ1v4venmtkUM5tsZuPMrHey/yAzm5j8bqKZHZD1nCeSc05OfjbL38uSqtOqVTSwzpgBP/xh1NZ33BFuu23dyijPPRdfsGbw9NOlv3ZG69Zxd9C6dbQnqCtq+Vq2DJ54oujtB5BDQjCz1sAI4BCgN3Bc4xd+lpHu3sfd+wKXAVck+98FvunufYCTgJXvZ7/j7n2Tn3kteSEiQIwUvvZaeP75mELi+ONjFPG0abmf49FH4zmbbBLzEu28c+Hizaett4Y//SmS2cUXpx2NrKt//SvuSovcfgC53SHsCcxy91fdfREwCjgi+wB3z7437wh4sn+Su7+Z7J8KdDCzIo7gkarVr198MV53HUyeHDOPnn12TLe9JnfeCYceGpPTjRtX9Fv2Fvv2t6Mt5eKLo4wm5SeF8QeNckkIWwFzs7Ybkn0rMLMhZjabuEMY1sR5jgYmufvnWfv+mpSLzjdLs0O3VKTWrWNk84wZsVj9ZZfF6Oc772y6jPTnP8cX6p57RtfWL3yh+DHnw5VXQo8eMYDugw/SjkbWVl1dzOHVrVvRL51LQmjqi3qV/5vcfYS7bwucDaywgrqZ7Qz8DvhR1u7vJKWkfZOfE5q8uNlgM6s3s/r55TDzpZSerl2jwfmZZ6BLFxg0KBa+f+WV5cf87ncxg+jXvx5jDTp3Ti/eltpwQ/j732Pq5CFD0o5G1saSJfHHSAp3B5BbQmgAumdtdwPeXM2xECWlIxs3zKwbMAY40d3/2/3B3d9IHj8CRhKlqVW4+/XuXuPuNV27ds0hXJHV+PKXY6qHq66KNoY+fWKCujPPhOHDY4qMe+6JNRrK3d57wwUXRGL4+9/TjkZyNXEifPRR0bubNsolIUwAeplZTzNrBxwLjM0+wMx6ZW0eCsxM9ncG7gfOcffxWce3MbMuyb/bAocBL7XkhYjkpE2bWB9jxowoD11yCfzhDzHA7dZbY8K4SnHOOdFt9rTTYjyFlL7G9Q/23z+VyzebENx9CTAUeBiYDox296lmdpGZHZ4cNtTMpprZZODnRI8ikudtB5y/UvfS9sDDZvYvYDLwBvDnvL4ykTXZfPMY2fz007GuwNVXF22K4aJp0yaSHMTI7nJdmrSaZDKxsuNm6fTCNy+joe41NTVeX1+fdhgi5WXkyGhg/tWv4Je/TDsaWZ3PP4eNN14+jiaPzGyiu9c0d1yF/UkkIqs4/vhICBddBM8+m3Y0sjr//Cd8+mlq7QeghCBSHUaMgO7dIzG0dEoPKYxMJkYm77dfaiEoIYhUg402ivaEf/87pguX0pPJwO67R9koJUoIItWif//oZnvLLbHsp5SOTz6JkfUplotACUGkupx/foxROPXUWK9XSsMzz8TyrCkNSGukhCBSTdq0iYFqS5fGrKhLl6YdkUCUi9q0ibU3UqSEIFJtttkmGpmffjqm7JD0ZTIxh1anTqmGoYQgUo1OOCFGal9wQXR3lPR8+GGs+pdyuQigTdoBiEgKzGJq8Gefja6okybF+sxp+OyzWOVt5szlPw0NcOml8KUvpRNTMT39dJTuUm5QBiUEkerVuXOssjZgAJx+eswIWyiLFsV8Stlf+q+8Eo9z5644HXnXrjFt91VXxZTklS6TgfbtY/LFlCkhiFSz/faLSfAuuQQOOQSOOWbdz7VkSYxzWPkLf+bM2J/dgN25c8z5v+++0KvXij+dO8ddy5gxcM010LZty19nKauri2Sw3nppR6KEIFL1Lrgg1oAYPDjWju7effXHLlsWf9Gv/IU/c2bcAWSv5dypU3zB9+sX02dkf+lvuuma1wuurY05mOrq4OCD8/daS81778WKfr/6VdqRAEoIItK2bXz59u0bK8s99hi8/XbTX/qzZ8ckbI3WWy++4Pv0gW99a/kX/vbbx4yd67oQ4te/Hgll9OjKTghPPhnlshJoPwAlBBGBWEP6qqvg5JNjgaBFi5b/rn172Hbb+KL/xjdW/NLfcst1/9Jfkw4d4Igj4O674dprK7dsVFcX73e/fmlHAighiEij730P3noL5s9f8Uu/W7dYn7rYamtj/qXHH48lTytRJhPtKCWyMJMSgogEM/jFL9KOYrmDD471oe+4ozITwttvw7RpcNJJzR9bJBqYJiKlqX37KBuNGbNiCatSPPFEPJZI+wEoIYhIKauthQULomxUaTKZmJZ8t93SjuS/lBBEpHQddFB8aY4enXYk+ZfJwFe/mk77zGooIYhI6WosG91zT2WVjV5/PbrwllC5CJQQRKTU1dbC++/H+IhKUVcXjyUwoV02JQQRKW2VWDbKZKBLF9hll7QjWUFOCcHMBprZDDObZWbDm/j9qWY2xcwmm9k4M+ud7D/IzCYmv5toZgdkPWePZP8sM7vSrBCjW0Sk7LVrB0cdFWWj7FHS5co9EsKAAdCqtP4mbzYaM2sNjAAOAXoDxzV+4WcZ6e593L0vcBlwRbL/XeCb7t4HOAn4W9ZzrgUGA72SnwrsaCwieVFbGzOgPvpo2pG03OzZMb13iZWLILc7hD2BWe7+qrsvAkYBR2Qf4O4fZm12BDzZP8nd30z2TwU6mFl7M9sC2NDdn3V3B24BjmzhaxGRSnXggTEL6h13pB1Jy2Uy8VhiDcqQ20jlrYC5WdsNwF4rH2RmQ4CfA+2Apl7p0cAkd//czLZKzpN9zq1yDVpEqkxj2eiuu6Js1L592hGtu0wm5oDafvu0I1lFLncITdX2fZUd7iPcfVvgbOC8FU5gtjPwO+BHa3PO5LmDzazezOrnz5+fQ7giUpFqa2O5yUceSTuSdecePYwGDCjMpIAtlEtCaACyJ0jvBry5mmMhSkr/Lf+YWTdgDHCiu8/OOme3XM7p7te7e42713Tt2jWHcEWkIh14IGy8cXn3Npo2DebNK8lyEeSWECYAvcysp5m1A44FxmYfYGa9sjYPBWYm+zsD9wPnuPv4xgPc/S3gIzPbO+lddCJwb4teiYhUtrZto2w0dmysw1yOSrj9AHJICO6+BBgKPAxMB0a7+1Qzu8jMDk8OG2pmU81sMtGO0Dh931BgO+D8pEvqZDPbLPndj4G/ALOA2cCDeXtVIlKZyr1sVFcHPXrETwky9yZL9yWppqbG6+vr0w5DRNKyeDF84Qux/vOtt6YdzdpZuhS6do27nBtuKOqlzWyiu9c0d1xpjYoQEVmTtm1jqc5774VPP007mrXz4osxc2uJlotACUFEyk1tLSxcCA8/nHYka6dE5y/KpoQgIuVlwADYdNPyG6SWycAOO8QYhBKlhCAi5aVNmygbjR1bPmWjxYvhqadKulwESggiUo4ay0YPPZR2JLmZODHiLeFyESghiEg52n//mD66XAapNY4/2H//VMNojhKCiJSfNm3g6KPhvvvKo2yUycCXvhTdTkuYEoKIlKdBg+Djj+HBEh/T+vnnMH58ybcfgBKCiJSrr341/uIu9bLRc8/FVBsl3n4ASggiUq6yy0affJJ2NKuXycTKaPvtl3YkzVJCEJHyVVsbyeCBB9KOZPUyGdhjj1jgp8QpIYhI+dpvP9hss9IdpPbxx/D882VRLgIlBBEpZ61bR9noH/+IL99SM358DEorgwZlUEIQkXJXymWjTCbaOvbZJ+1IcqKEICLlbd99YfPNS7O3UV0d7LUXdOyYdiQ5UUIQkfLWujUccwzcf39plY0++ADq68umXARKCCJSCQYNihHL99+fdiTLPfUULFumhCAiUlT77BMrqZVS2aiuDtq3h733TjuSnCkhiEj5yy4bLVyYdjQhk4H+/aFDh7QjyZkSgohUhtramCLiH/9IOxJ4991YMrOMykWghCAilaJ/f9hii9IoGz35ZDyWyYC0RkoIIlIZWrWKstGDD8JHH6UbSyYTXU379Us3jrWUU0Iws4FmNsPMZpnZ8CZ+f6qZTTGzyWY2zsx6J/s3NbM6M1toZlev9JwnknNOTn42y89LEpGqVSplo0wmptVo2zbdONZSswnBzFoDI4BDgN7AcY1f+FlGunsfd+8LXAZckez/DDgfOGM1p/+Ou/dNfuat0ysQEWn0la/EIvZplo3eegtefrnsykWQ2x3CnsAsd3/V3RcBo4Ajsg9w9w+zNjsCnuz/2N3HEYlBRKSwWrWKMQkPPggfftj88YVQVxePZdagDLklhK2AuVnbDcm+FZjZEDObTdwhDMvx+n9NykXnm5nl+BwRkdUbNChWKUurbJTJxFTXffumc/0WyCUhNPVF7avscB/h7tsCZwPn5XDe77h7H2Df5OeEJi9uNtjM6s2sfv78+TmcVkSq2pe/DFttlV7ZKJOJ1dxat07n+i2QS0JoALpnbXcD3lzD8aOAI5s7qbu/kTx+BIwkSlNNHXe9u9e4e03XEl+gWkRKQJplo9degzlzyrJcBLklhAlALzPraWbtgGOBsdkHmFmvrM1DgZlrOqGZtTGzLsm/2wKHAS+tTeAiIqtVWwuLFsHYsc0fm09l3H4A0Ka5A9x9iZkNBR4GWgM3uvtUM7sIqHf3scBQM/sasBhYAJzU+Hwzew3YEGhnZkcCBwP/Bh5OkkFr4DHgz3l9ZSJSvfbaC7p3j7LRd79bvOtmMtC1K+y8c/GumUfNJgQAd38AeGClfb/M+vfpa3huj9X8ao9cri0istYaB6mNGBHTUG+0UeGv6R53CAMGQJn2kdFIZRGpTMUuG82cCW+8UbblIlBCEJFKtddesPXWxettlMnEoxKCiEiJMYveRg8/DO+/X/jr1dVFd9fttiv8tQpECUFEKldtLSxeXPiy0bJlkRAOOKBs2w9ACUFEKlm/fvDFLxa+bDR1KsyfX9blIlBCEJFK1lg2euQRWLCgcNdpHH9QhhPaZVNCEJHK1lg2uvfewl0jk4Fttom7kTKmhCAila2mBnr0KFzZaOlSeOKJsi8XgRKCiFS6xrLRo48Wpmw0eXIMfivzchEoIYhINaithSVL4J578n/uxvEHSggiImVgjz2gZ8/ClI0yGdhpJ9hii/yfu8iUEESk8pnFXcJjj8F77+XvvIsXw9PGlkBFAAAJ9ElEQVRPV8TdASghiEi1KETZaMIE+PjjimhQBiUEEakWu+0WXUPvuCN/52xsP9h///ydM0VKCCJSHQpRNspkYNddYdNN83O+lCkhiEj1qK2NcQNjxrT8XJ99Bs88UzHlIlBCEJFq0rdvzEaaj95Gzz4Ln3+uhCAiUpYay0aZDLz7bsvOlcnEymz77puf2EqAEoKIVJdBg/JTNqqri2kxirE8Z5EoIYhIddl1V+jVq2Vlo4UL4fnnK6pcBEoIIlJtsstG8+ev2znGjYsxDRUyIK2REoKIVJ/a2ljl7O671+35dXXQti3075/fuFKWU0Iws4FmNsPMZpnZ8CZ+f6qZTTGzyWY2zsx6J/s3NbM6M1toZlev9Jw9kufMMrMrzcp43TkRKS99+sD226/7ILVMBvbeGzp2zG9cKWs2IZhZa2AEcAjQGziu8Qs/y0h37+PufYHLgCuS/Z8B5wNnNHHqa4HBQK/kZ+A6vQIRkbXVWDaqq4N589buue+/Dy+8UHHlIsjtDmFPYJa7v+rui4BRwBHZB7j7h1mbHQFP9n/s7uOIxPBfZrYFsKG7P+vuDtwCHLnuL0NEZC2ta9noqafieRXWoAy5JYStgLlZ2w3JvhWY2RAzm03cIQzL4ZwNzZ1TRKRgdtkFdtxx7XsbZTLQoUOUjCpMLgmhqdq+r7LDfYS7bwucDZyXj3MCmNlgM6s3s/r569ojQERkZY1loyefhHfeyf15mQzssw+0b1+42FKSS0JoALpnbXcD3lzD8aNovvzTkJyn2XO6+/XuXuPuNV27ds0hXBGRHA0atHZlo/nzYcqUimw/gNwSwgSgl5n1NLN2wLHA2OwDzKxX1uahwMw1ndDd3wI+MrO9k95FJwL3rlXkIiIttfPOsdpZrmWjJ56IxwpsP4AcEoK7LwGGAg8D04HR7j7VzC4ys8OTw4aa2VQzmwz8HDip8flm9hrR6+h7ZtaQ1UPpx8BfgFnAbODBPL0mEZHcZJeN3n67+eMzGejUKaasqEAWnXzKQ01NjdfX16cdhohUkqlTo4H56qthyJA1H7vDDjFb6v33Fye2PDGzie7ebBbTSGURqW477xw/zQ1Se+MNeOWVii0XgRKCiEg0Lj/1FLz11uqPqauLRyUEEZEKNmgQuMNdd63+mEwGNt44ZkutUEoIIiK9e0c7wpp6G9XVwf77x6I4FapyX5mIyNqorY1prd9sYkjUnDnw2msVXS4CJQQRkbCmslEmE48VOiCtkRKCiAjEvEZ9+jRdNqqrg802i9JSBVNCEBFp1Fg2euON5fvc4w7hgANiIFsFU0IQEWk0aFA83nnn8n0zZkR31AovF4ESgojIcjvsEN1KswepVcH4g0ZKCCIi2QYNgvHjoSFZsiWTge7dYdtt042rCJQQRESyZZeNli2LO4QBAyq+/QCUEEREVrT99tC3b/Q2eukleO+9qigXgRKCiMiqamvh2Wfh5ptjuwoalEEJQURkVY1loyuvjLaDrbdON54iUUIQEVnZdtvBbrvBkiVVUy4CJQQRkabV1sZjFSWENmkHICJSkk45Bd55Bw47LO1IikYJQUSkKV26wB//mHYURaWSkYiIAEoIIiKSUEIQERFACUFERBI5JQQzG2hmM8xslpkNb+L3p5rZFDObbGbjzKx31u/OSZ43w8y+nrX/tazn1Ofn5YiIyLpqtpeRmbUGRgAHAQ3ABDMb6+7Tsg4b6e7XJccfDlwBDEwSw7HAzsCWwGNmtr27L02eN8Dd383fyxERkXWVyx3CnsAsd3/V3RcBo4Ajsg9w9w+zNjsCnvz7CGCUu3/u7nOAWcn5RESkxOSSELYC5mZtNyT7VmBmQ8xsNnAZMCyH5zrwiJlNNLPBq7u4mQ02s3ozq58/f34O4YqIyLrIZWBaU5OA+yo73EcAI8zseOA84KRmntvf3d80s82AR83sZXd/qonzXg9cD2Bm883s3znEXMq6ACqTBb0XK9L7sSK9H8u19L34Yi4H5ZIQGoDuWdvdgDfXcPwo4NrmnuvujY/zzGwMUUpaJSFkc/euOcRb0sys3t1r0o6jFOi9WJHejxXp/ViuWO9FLiWjCUAvM+tpZu2IRuKx2QeYWa+szUOBmcm/xwLHmll7M+sJ9AL+aWYdzaxT8tyOwMHASy17KSIi0hLN3iG4+xIzGwo8DLQGbnT3qWZ2EVDv7mOBoWb2NWAxsIAoF5EcNxqYBiwBhrj7UjPbHBhjsSRdG6KX0kMFeH0iIpIjc1+lOUAKyMwGJ+0iVU/vxYr0fqxI78dyxXovlBBERATQ1BUiIpJQQigQM+tuZnVmNt3MpprZ6cn+TczsUTObmTxunHasxWRmrc1skpn9I9nuaWbPJ+/H7UnHhYpnZp3N7E4zezn5jHy5mj8bZvaz5P+Tl8zsNjPrUE2fDTO70czmmdlLWfua/DxYuDKZEuhfZrZ7vuJQQiicJcD/uPtOwN7AkGQqj+HA4+7eC3g82a4mpwPTs7Z/B/wxeT8WAD9IJari+z/gIXffEdiVeE+q8rNhZlsRg1lr3H0XovPKsVTXZ+MmYOBK+1b3eTiE6LHZCxjM8m7+LaaEUCDu/pa7v5D8+yPif/itiOk8bk4Ouxk4Mp0Ii8/MuhHdkv+SbBtwAHBnckhVvB9mtiGwH3ADgLsvcvf3qeLPBtHbcD0zawOsD7xFFX02kkG5/1lp9+o+D0cAt3h4DuhsZlvkIw4lhCIwsx7AbsDzwObu/hZE0gA2Sy+yovtf4CxgWbK9KfC+uy9JtpucFqUCbQPMB/6alM/+kozHqcrPhru/AfwBeJ1IBB8AE6nOz0a21X0ecppOaF0oIRSYmW0A3AX8dKVJAKuKmR0GzHP3idm7mzi0Grq9tQF2B651992Aj6mS8lBTktr4EUBPYlbkjkRZZGXV8NnIRcH+v1FCKCAza0skg7+7+93J7ncab++Sx3lpxVdk/YHDzew1YnqTA4g7hs5JmQCanxalUjQADe7+fLJ9J5EgqvWz8TVgjrvPd/fFwN3AV6jOz0a21X0e1nY6oZwpIRRIUh+/AZju7ldk/WosyUju5PHeYseWBnc/x927uXsPosEw4+7fAeqAY5LDquL9cPe3gblmtkOy60BiNH9VfjaIUtHeZrZ+8v9N4/tRdZ+Nlazu8zAWODHpbbQ38EFjaamlNDCtQMxsH+BpYArLa+a/INoRRgNbE/8jDHL3lRuTKpqZ7Q+c4e6Hmdk2xB3DJsAk4Lvu/nma8RWDmfUlGtfbAa8C3yf+QKvKz4aZ/Qr4NtE7bxJwClEXr4rPhpndBuxPzGr6DnABcA9NfB6SpHk10SvpE+D77p6XVSeVEEREBFDJSEREEkoIIiICKCGIiEhCCUFERAAlBBERSSghiIgIoIQgIiIJJQQREQHg/wEX/Xwbsf6ugAAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x25c852ee7b8>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.plot(Ks,np.array(CH_scores),\"r-\")#只看内部评价指标，外部不看了"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "state：1.是在K = 30的时候，CH_score最高\n",
    "       2.是在K = 100的时候，v_score最高"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 194,
   "metadata": {},
   "outputs": [],
   "source": [
    "# n_clusters = 30\n",
    "# mbKmeans = MiniBatchKMeans(n_clusters = 30)\n",
    "# mbKmeans.fit(X_train_pca)\n",
    "\n",
    "# y_train_pred = mbKmeans.labels_#预测的结果   #没有把预测结果都打出来，打出来就太多了，\n",
    "# cents = mbKmeans.cluster_centers_#质心\n",
    "\n",
    "# for i in range(n_clusters):\n",
    "#     index = np.nonzero(y_train_pred==i)[0]\n",
    "#     x1 = X_train_pca[index,0]\n",
    "#     x2 = X_train_pca[index,1]\n",
    "#     y_i = y_train[index]\n",
    "#     for j in range(len(x1)):\n",
    "#         if j < 2:  #每类打印2个\n",
    "#             plt.text(x1[j],x2[j],str(int(y_i[j])))\n",
    "# plt.show()\n",
    "# ValueError: Image size of 472033954x-772189389 pixels is too large. It must be less than 2^16 \n",
    "#             in each direction."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
